Project: Wrangling and Analyze Data

Data Gathering

In the cell below, gather all three pieces of data for this project and load them in the notebook. Note: the methods required to gather each data are different.

  1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)
  1. Use the Requests library to download the tweet image prediction (image_predictions.tsv)
  1. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)
import tweepy from tweepy import OAuthHandler import json from timeit import default_timer as timer # Query Twitter API for each tweet in the Twitter archive and save JSON in a text file # These are hidden to comply with Twitter's API terms and conditions consumer_key = 'HIDDEN' consumer_secret = 'HIDDEN' access_token = 'HIDDEN' access_secret = 'HIDDEN' auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret) api = tweepy.API(auth, wait_on_rate_limit=True) # NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES: # df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to # change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv # NOTE TO REVIEWER: this student had mobile verification issues so the following # Twitter API code was sent to this student from a Udacity instructor # Tweet IDs for which to gather additional data via Twitter's API tweet_ids = df_twitter_archive.tweet_id.values len(tweet_ids) # Query Twitter's API for JSON data for each tweet ID in the Twitter archive count = 0 fails_dict = {} start = timer() # Save each tweet's returned JSON as a new line in a .txt file with open('tweet_json.txt', 'w') as outfile: # This loop will likely take 20-30 minutes to run because of Twitter's rate limit for tweet_id in tweet_ids: count += 1 print(str(count) + ": " + str(tweet_id)) try: tweet = api.get_status(tweet_id, tweet_mode='extended') print("Success") json.dump(tweet._json, outfile) outfile.write('\n') except tweepy.TweepError as e: print("Fail") fails_dict[tweet_id] = e pass end = timer() print(end - start) print(fails_dict)

END : GATHER DATA

The 3 dataframes are:

df_twitter_archive- contains data read from provided csv

df_predictions - contains data read (by using requests) from tsv file hosted on server

tweet_list - contains data obtained from twitter

Assessing Data

In this section, detect and document at least eight (8) quality issues and two (2) tidiness issue. You must use both visual assessment programmatic assessement to assess the data.

Note: pay attention to the following key points when you access the data.

First Dataset

Second Dataset

Third Dataset

Quality issues

df_twitter_archive dataframe

  1. Incorrect names or missing names in name column such as, a, an, the... - all are written with lower case letters
  2. retweeted_status_timestamp, timestamp should be datetime instead of object (string).
  3. in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id should be integers/strings instead of float.
  4. In several columns null objects are non-null (None to NaN).

df_predictions dataframe

  1. Remove duplicate jpg_url entries
  2. Remove entries that have p1_dog, p2_dog, & p3_dog values set to false. These are not dogs of any kind.

tweet_list dataframe

  1. Remove retweets
  2. crewated_at should be of datatime datatype instead of string

Tidiness issues

  1. Combine the 4 dog category columns into a single column in the df_twitter_archive table
  2. Join df_twitter_archive, df_predictions and tweet_list tables

Cleaning Data

In this section, clean all of the issues you documented while assessing.

Note: Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of tidy data. The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

Issue #1: Archive: Incorrect names or missing names in name column

Define: Remove all invalide dog names

Code

Test

Issue #2: Archive: retweeted_status_timestamp, timestamp should be datetime instead of object (string).

Define

Correct DataType of the columns to Datetime

Code

Test

Issue #3: Archive: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id should be integers/strings instead of float.

Define

Correct DataType of the columns to strings

Code

Test

Issue #4: Archive: 4. In several columns null objects are non-null (None to NaN).

Define

Drop NaN values in Name column

Code

Test

Issue #5: df_predictions: duplicate jpg_url entries

Define

Remove duplicate jpg_url entries

Code

Test

Issue #6: df_predictions: Entries with p1_dog, p2_dog, & p3_dog values set to false. No Dogs

Define

remove rows with p1_dog, p2_dog, & p3_dog values set to false.

Code

Test

Issue #7: tweet_list_clean: Many Retweets

Define

remove retweets

Code

Test

Issue #8: tweet_list_clean: created_at should be of datatime datatype

Define

Change datatype

Code

Test

Issue #9: Tidiness: 4 dog category columns in the archive_clean table

Define

Combine 4 dog category columns in the archive_clean table

Code

Test

Issue #10: Tidiness: 3 separate tables

Define

join archive_clean, predictions_clean and tweet_list_clean tables

Code

Test

Storing Data

Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

Analyzing and Visualizing Data

In this section, analyze and visualize your wrangled data. You must produce at least three (3) insights and one (1) visualization.

Insights:

  1. Cooper is most used Dog name

  2. Clumber is the dog breed with the highest numerator

  3. Pupper is the most found dog category

Visualization